Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 19, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:52pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:52pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:52pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:52pm

Copy link
Member Author

NathanFlurry commented Nov 19, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 19, 2025

Pull Request Review: Remove TunnelAck

This PR removes the acknowledgement mechanism (TunnelAck) from the tunnel protocol and simplifies the message flow between pegboard-gateway and pegboard-runner. Here's my analysis:

✅ Strengths

1. Protocol Simplification

  • Removes bidirectional ack complexity, making the system easier to understand and maintain
  • Eliminates state tracking for active requests on the runner side (tunnel_active_requests HashMap removed from conn.rs)
  • Reduces message overhead by removing ack round trips

2. Architectural Improvement

  • Introduces gateway_id field in tunnel messages, allowing runners to route responses directly to the correct gateway without maintaining state
  • Gateway-to-runner routing becomes stateless from the runner's perspective
  • Cleaner separation: gateway handles in-flight state, runner just forwards messages

3. Code Quality

  • Removes ~100+ lines of complex ack handling logic
  • Eliminates the need to track request lifecycle states
  • Better protocol versioning with new ToRunner and ToGateway union types

⚠️ Concerns and Recommendations

1. TODO Comments (Shared State - lines 228-249)
The ToGatewayKeepAlive handler is stubbed out with a large TODO block containing old TunnelAck logic. This is confusing:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Recommendation: Either implement the keep-alive mechanism or remove the TODO if it's not needed. The commented code references TunnelAck which no longer exists.

2. Message Acknowledgement Timeout Logic (shared_state.rs:447-453)
The GC logic checks for message timeouts but the condition appears inverted:

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

This triggers a timeout when the message is within the timeout window, not beyond it. Should be >= instead of <=.

Same issue on line 458-462 for WebSocket messages.

3. Missing Request Cleanup on Runner Disconnect
Previously, when the runner disconnected, active requests received abort messages (removed from lib.rs:246-276). Now there's no cleanup mechanism when runners disconnect unexpectedly.

Question: How does the gateway handle runner disconnections? Does the workflow system handle this, or could requests hang indefinitely?

4. Protocol Version Conversion Edge Cases (versioned.rs:1231-1236 and 1331-1338)
The v2→v3 conversion maps deprecated TunnelAck to RequestAbort or ResponseAbort:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

While the comment says this shouldn't happen, if a v2 client sends TunnelAck, it will be interpreted as an abort, potentially breaking active connections.

Recommendation: Add logging or explicit error handling for this case to aid debugging during the migration period.

5. Hibernation State Management
The pending_msgs vector (line 179-182) is still populated but never cleared since TunnelAck removal. This could cause:

  • Memory leaks for long-lived connections
  • False positive timeouts in GC

Question: What's the intended lifecycle for pending_msgs now? Should it be removed entirely?

6. TypeScript Protocol Missing Index Field
In the v2→v3 conversion (versioned.rs:1219), the TypeScript protocol defaulted index to 0, but now uses msg.index. Verify the TypeScript SDK properly handles this field.

🔍 Testing Recommendations

  1. Load testing - Verify the system handles high message volumes without the ack backpressure mechanism
  2. Failure scenarios - Test runner disconnections during active requests
  3. Protocol migration - Ensure v2 clients gracefully upgrade to v3
  4. Memory profiling - Verify pending_msgs doesn't cause leaks
  5. Hibernation flow - Test WebSocket hibernation/resume with pending messages

📝 Minor Issues

  • Line 165: Gateway ID should use self.gateway_id.as_bytes() without dereferencing (though this works, it's cleaner)
  • Documentation: Consider updating protocol docs to explain the new stateless routing model

Summary

This is a solid architectural improvement that simplifies the protocol. The main concerns are around the incomplete keep-alive implementation, the inverted timeout logic, and potential memory leaks from pending_msgs. Once these are addressed, this should be good to merge.

Priority Issues:

  1. Fix timeout condition logic (lines 449 and 458)
  2. Clarify pending_msgs lifecycle and GC behavior
  3. Implement or remove ToGatewayKeepAlive TODO

@MasterPtato MasterPtato force-pushed the 11-19-chore_engine-runner_remove_echo_close_on_toclientwebsocketclose branch from 972242d to 556b074 Compare November 19, 2025 21:19
@MasterPtato MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 4adef6f to 759df3e Compare November 19, 2025 21:19
@NathanFlurry NathanFlurry force-pushed the 11-19-chore_engine-runner_remove_echo_close_on_toclientwebsocketclose branch from 556b074 to 972242d Compare November 19, 2025 21:22
@NathanFlurry NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 759df3e to 4adef6f Compare November 19, 2025 21:22
@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: Remove TunnelAck

Summary

This PR removes the TunnelAck acknowledgment mechanism from the pegboard gateway/runner protocol and simplifies the tunnel message flow by embedding the gateway ID directly in messages, eliminating the need for stateful request tracking.

Architecture Changes

Key Improvements ✅

  1. Simplified Message Flow: The removal of TunnelAck eliminates bidirectional acknowledgment complexity, reducing network overhead and latency.

  2. Stateless Gateway Routing: By embedding gateway_id in ToServerTunnelMessage, the runner can now route messages without maintaining active request state (tunnel_active_requests HashMap removed).

  3. Protocol Evolution: Introduces ToRunner union type to support future gateway-to-runner specific messages like ToRunnerKeepAlive (currently stubbed with TODO).

Code Quality & Best Practices

Strengths ✅

  • Follows CLAUDE.md conventions: Structured logging with tracing, proper error handling with anyhow
  • Backward compatibility: Version conversion functions handle the deprecated TunnelAck gracefully for v2→v3 migration
  • Reduced complexity: Removed 150+ lines of stateful tracking logic
  • Type safety: Protocol changes properly reflected across Rust and TypeScript SDKs

Concerns ⚠️

1. Incomplete Feature Implementation

// pegboard-gateway/src/shared_state.rs:227-249
Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Issue: The ToGatewayKeepAlive handler is completely commented out, leaving pending message cleanup unimplemented. This could lead to unbounded memory growth in in_flight.pending_msgs.

Impact: Without acknowledgment cleanup, the gateway will accumulate pending messages indefinitely until requests complete, potentially exhausting memory under high load.

Recommendation: Either implement the keepalive mechanism or remove the pending message tracking entirely if it's no longer needed.

2. Hardcoded Default Values in Backward Compatibility

// runner-protocol/src/versioned.rs:165-169
v2::ToClient::ToClientTunnelMessage(msg) => {
    v3::ToClient::ToClientTunnelMessage(v3::ToClientTunnelMessage {
        gateway_id: [0; 16],  // ⚠️ Zero UUID for v2 messages
        // ...

Issue: V2 messages get a zero UUID for gateway_id, which could cause routing failures if v2 runners are still active.

Recommendation: Either:

  • Document that v2 protocol is fully deprecated and incompatible with this change
  • Store the actual gateway ID in a way that's accessible during deserialization
  • Add validation to reject v2 messages if they can't be properly routed

3. Silent Failure Mode

// runner-protocol/src/versioned.rs:1232-1236
v2::ToClientTunnelMessageKind::TunnelAck => {
    // TunnelAck is deprecated and should not be used
    // For backwards compatibility, we skip it
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Issue: Converting deprecated TunnelAck to ToClientRequestAbort silently changes semantics, potentially causing unexpected request termination.

Recommendation: Log a warning when this conversion occurs, or consider returning an error to surface the protocol mismatch.

Performance Considerations

Positive Impact ✅

  • Reduced message overhead: No more ack roundtrips for every tunnel message
  • Memory efficiency: Removed per-request TunnelActiveRequest HashMap in runner
  • Simpler code paths: Fewer branches in message handling

Potential Issues ⚠️

  • Pending message accumulation: Without proper ack cleanup (see concern [SVC-2555] Set up issue templates #1), memory usage could grow unbounded
  • Missing reliability mechanism: If the previous TunnelAck was used for flow control or reliability, its removal could impact message delivery guarantees

Security Concerns

Low Risk ✅

No immediate security issues identified. The gateway_id is a fixed-size 16-byte array, preventing injection attacks.

Consideration 💭

  • Ensure that gateway_id validation happens at the protocol boundary to prevent routing to unintended gateways
  • The zero UUID fallback for v2 messages (concern [SVC-2479] Send cluster events to PostHog #2) could theoretically allow routing confusion if not handled carefully

Test Coverage

Missing ⚠️

No test files found for pegboard-gateway or pegboard-runner packages. This is a significant protocol change that should have:

  1. Unit tests for message serialization/deserialization
  2. Integration tests verifying the new routing mechanism
  3. Tests for backward compatibility with v2 protocol
  4. Tests for the keepalive mechanism (once implemented)

Recommendation: Add test coverage before merging, especially for:

  • Gateway ID routing correctness
  • Pending message cleanup behavior
  • Protocol version conversion edge cases

Recommendations

Before Merging 🔴

  1. Implement or remove keepalive handling: Address the TODO in shared_state.rs:227
  2. Clarify v2 compatibility: Document or fix the zero gateway ID issue
  3. Add test coverage: At minimum, test the happy path and version conversion

Post-Merge Considerations 🟡

  1. Monitor memory usage in production for pending message accumulation
  2. Consider adding metrics for message routing failures
  3. Update any external documentation referencing TunnelAck

Overall Assessment

This is a well-structured refactoring that simplifies the codebase and removes unnecessary complexity. However, the incomplete keepalive implementation and backward compatibility concerns should be addressed before merging to production.

Status: ⚠️ Approve with changes requested


Review generated with Claude Code - following rivet-dev/rivet conventions from CLAUDE.md

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review - PR #3488: Remove TunnelAck

Summary

This PR removes the TunnelAck acknowledgment mechanism from the tunnel protocol and simplifies the message flow between pegboard-gateway and pegboard-runner. The change introduces a new ToRunner message type and adds a gateway_id field to tunnel messages to enable direct routing without maintaining stateful request tracking.


Positive Aspects

Protocol Simplification: Removing TunnelAck eliminates a complex acknowledgment protocol, reducing code complexity significantly.

Reduced State Management: The PR removes tunnel_active_requests HashMap and associated lifecycle tracking logic (~80 lines removed from conn.rs and lib.rs).

Cleaner Architecture: Using gateway_id for direct message routing is more straightforward than maintaining request-to-subject mappings.

Backwards Compatibility: The versioning logic properly handles the deprecated TunnelAck in v2→v3 conversions (lines 1232-1237, 1320-1327 in versioned.rs).


Issues & Concerns

🔴 Critical: Commented-Out TODO Code

Location: pegboard-gateway/src/shared_state.rs:227-249

There's a large block of commented-out code handling ToGatewayKeepAlive:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Issues:

  1. This suggests incomplete implementation - the keep-alive mechanism appears to be planned but not implemented
  2. The commented code references removing messages from pending_msgs, but this functionality is now lost
  3. No explanation of when/how this will be completed
  4. The pending_msgs vector at line 179-182 is still being populated but never cleaned up

Recommendation: Either:

  • Implement the keep-alive mechanism properly in this PR, or
  • File a follow-up issue and add a TODO comment with the issue number, or
  • Remove the commented code if it's truly not needed

🟡 Medium: Potential Memory Leak

Location: pegboard-gateway/src/shared_state.rs:179-182

req.pending_msgs.push(PendingMessage {
    message_id,
    send_instant: now,
});

Issue: Messages are added to pending_msgs but the commented-out ToGatewayKeepAlive handler that would remove them is not implemented. The GC logic at lines 447-453 checks for timeouts but the condition at line 448 appears inverted:

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

This should likely be > not <= - it triggers GC when the message is not timed out, which seems backwards.

Recommendation: Fix the GC logic condition or clarify the intended behavior with comments.

🟡 Medium: Protocol Version Handling

Location: versioned.rs:1232-1237, 1320-1327

The conversion of deprecated TunnelAck to ToClientRequestAbort / ToServerResponseAbort could cause issues:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // For backwards compatibility, we skip it
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Issues:

  1. Comment says "we skip it" but actually converts to RequestAbort - this will abort the request
  2. If v2 clients send TunnelAck, it will incorrectly terminate their requests
  3. No testing for mixed-version scenarios

Recommendation: Either truly skip the message (return early) or add integration tests for protocol version mixing.

🟡 Medium: Missing Error Handling

Location: ws_to_tunnel_task.rs:368-369

let gateway_reply_to = GatewayReceiverSubject::new(Uuid::from_bytes(msg.gateway_id)).to_string();

Issue: The gateway_id is assumed to be valid, but there's no validation. If it's zero/invalid (as seen in the default value at line 165 of shared_state.rs and versioned.rs lines 178, 493), messages could be routed incorrectly.

Recommendation: Add validation or document why [0; 16] is safe for backwards compatibility.

🟢 Minor: TypeScript Schema Generation

Location: runner-protocol/src/index.ts

The generated TypeScript looks correct, but the large diff (184 additions, 49 deletions) suggests significant schema changes.

Recommendation: Ensure TypeScript clients are updated and tested. Consider documenting the breaking changes in a migration guide.

🟢 Minor: Cleanup Opportunity

Location: pegboard-runner/src/lib.rs:244-276

Removed code that sent close messages to active requests. This cleanup is good, but verify that request cleanup now happens correctly through other mechanisms.


Security Concerns

No major security issues identified. The removal of TunnelAck doesn't appear to introduce vulnerabilities. However:

  • Ensure the gateway_id routing cannot be exploited to intercept messages
  • The zero-value gateway_id in backward compatibility code should be validated

Performance Considerations

Positive: Removing tunnel_active_requests HashMap reduces memory overhead and lock contention.

⚠️ Concern: The unbounded pending_msgs vector (if not properly GC'd) could grow indefinitely.


Testing

Missing: No test coverage visible in this PR for:

  • Mixed protocol versions (v2 ↔ v3)
  • Keep-alive mechanism
  • Gateway ID routing
  • Pending message GC

Recommendation: Add integration tests before merging.


Recommendations Summary

Before merging:

  1. ✅ Implement or remove the commented ToGatewayKeepAlive handler
  2. ✅ Fix the GC logic condition at line 448 in shared_state.rs
  3. ✅ Add validation for gateway_id
  4. ✅ Add tests for protocol version compatibility
  5. ⚠️ Document the breaking changes for TypeScript/runner clients

Nice to have:

  • Add metrics for pending message queue sizes
  • Document the new message flow in architecture docs

Conclusion

This is a solid refactoring that simplifies the protocol, but has some implementation gaps that need addressing before merging. The main concern is the incomplete keep-alive mechanism and potential memory leak from unbounded pending_msgs. Once those are resolved, this will be a good improvement to the codebase.

Overall Assessment: ⚠️ Needs Work - Address critical TODOs and GC logic before merging.

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: Remove TunnelAck

This PR removes the TunnelAck acknowledgment mechanism from the runner protocol and replaces it with a gateway-ID-based message routing system. Overall, the changes look good and simplify the protocol, but I've identified several important issues to address.


🔴 Critical Issues

1. Missing gateway_id in TypeScript SDK message sending

Location: engine/sdks/typescript/runner/src/tunnel.ts:430-437

The TypeScript #sendMessage method creates a ToServerTunnelMessage but does not include the required gateway_id field. This will cause protocol deserialization failures on the Rust side.

// Current code - MISSING gatewayId
const message: protocol.ToServer = {
    tag: "ToServerTunnelMessage",
    val: {
        requestId,
        messageId,
        messageKind,  // gateway_id is missing here!
    },
};

Fix needed: Add gatewayId field to the message. The gateway ID should be extracted from the incoming ToClientTunnelMessage and stored in the tunnel or actor state.

2. Gateway ID not tracked in TypeScript SDK

Location: engine/sdks/typescript/runner/src/tunnel.ts

The TypeScript runner needs to:

  1. Extract gatewayId from incoming ToClientTunnelMessage messages
  2. Store it per-request (likely in the actor's request tracking)
  3. Include it when sending ToServerTunnelMessage back

Currently, there's no mechanism to capture or store the gateway ID from incoming messages.


⚠️ Warnings & Potential Issues

3. Incomplete TODO comments

Location: engine/packages/pegboard-gateway/src/shared_state.rs:227-249

Large commented-out block for ToGatewayKeepAlive handling with a TODO: comment. This suggests incomplete implementation:

  • What's the plan for keep-alive messages?
  • Should pending message tracking be removed entirely, or is this temporary?
  • The commented code references message_id but ToGatewayKeepAlive is defined as void in the schema

Recommendation: Either implement the keep-alive mechanism or remove the TODO comment and clarify the design decision.

4. Pending message cleanup may not work correctly

Location: engine/packages/pegboard-gateway/src/shared_state.rs:179-182

The pending_msgs vector is still being populated but never cleared since TunnelAck removal. This means:

  • The pending_msgs vector will grow indefinitely for long-lived requests
  • The GC logic at line 447-453 checks pending message timeouts but uses comparison operators that will timeout messages (condition on line 448-450 is backwards)
// Line 447-453 - Bug in condition
if let Some(earliest_pending_msg) = req.pending_msgs.first() {
    if now.duration_since(earliest_pending_msg.send_instant)
        <= MESSAGE_ACK_TIMEOUT  // Should be >= not <=
    {
        break 'reason Some(MsgGcReason::MessageNotAcked{...});
    }
}

This will cause the GC to trigger timeouts for messages that are still fresh!

5. Protocol versioning compatibility concerns

Location: engine/sdks/rust/runner-protocol/src/versioned.rs:1231-1237, 1305-1312

The backwards compatibility converters map removed TunnelAck to ToClientRequestAbort / ToServerResponseAbort. This is a questionable fallback:

  • Acks are not semantically equivalent to aborts
  • May cause unexpected behavior if v2 clients/gateways try to communicate with v3 endpoints
// Lines 1231-1237
v2::ToClientTunnelMessageKind::TunnelAck => {
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Recommendation: Consider making this a hard error or at minimum adding metrics/logging to detect if this fallback is ever hit in production.


✅ Good Practices

  1. Clean state management simplification - Removing the complex TunnelActiveRequest tracking from conn.rs simplifies the runner state management significantly.

  2. Gateway ID approach - Using the gateway ID for routing is more scalable than maintaining per-request state on both sides.

  3. Protocol schema changes - The BARE schema updates correctly define the new GatewayId type and update both message structures.

  4. Rust error handling - Continues to use anyhow::Result and proper context as per CLAUDE.md guidelines.


📋 Minor Suggestions

  1. Logging consistency: The log messages follow the structured logging pattern correctly (lowercase, using ? for debug). Good adherence to CLAUDE.md.

  2. Consider adding metrics: This is a significant protocol change. Consider adding metrics for:

    • Gateway ID mismatches
    • Unknown gateway IDs
    • Message routing failures
  3. WebSocket hibernation: The change impacts how WebSocket messages are tracked during hibernation. Ensure the hibernation flow is thoroughly tested with the new gateway-id-based routing.


🧪 Testing Recommendations

Before merging, please ensure:

  1. Integration tests covering gateway ↔ runner message flow
  2. WebSocket hibernation tests with the new routing
  3. Protocol version compatibility tests (v2 ↔ v3)
  4. Load tests to verify the GC properly cleans up pending messages

Summary

Overall assessment: The architectural change is sound, but there are critical bugs that will cause runtime failures:

  • ❌ Missing gateway_id in TypeScript SDK will break all tunnel messages from TypeScript runners
  • ❌ GC timeout logic is inverted and will incorrectly timeout fresh messages
  • ⚠️ Incomplete keep-alive implementation needs clarification

Recommendation: Address the critical issues before merging. The TypeScript SDK bug is a blocker.

@claude
Copy link

claude bot commented Nov 19, 2025

Pull Request Review: Remove TunnelAck

Summary

This PR removes the TunnelAck acknowledgement mechanism from the pegboard tunnel protocol, simplifying the communication flow between gateway and runner. The change eliminates the bidirectional acknowledgement system in favor of a unidirectional message flow with a gateway_id field for routing.

Code Quality & Best Practices

✅ Strengths

  1. Good Protocol Evolution: The versioned protocol conversion handles backward compatibility properly by mapping deprecated TunnelAck variants to appropriate fallback messages in versioned.rs:1231-1238 and 1331-1338.

  2. Consistent Naming: The new ToRunner and ToGateway union types follow existing conventions and improve clarity about message direction.

  3. Proper Logging: Structured logging patterns are correctly used throughout (e.g., tracing::debug!(?x, "message") instead of string formatting).

  4. Simplified State Management: Removing tunnel_active_requests HashMap from pegboard-runner/src/conn.rs:32-33 eliminates complex lifecycle tracking and potential memory leaks.

⚠️ Areas for Improvement

  1. Incomplete TODO Comments (shared_state.rs:227-249):

    Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
        // TODO:
        // let prev_len = in_flight.pending_msgs.len();
        // ...

    Issue: The ToGatewayKeepAlive handler is stubbed out with commented code. This suggests the keep-alive acknowledgement mechanism is unfinished.

    Recommendation: Either:

    • Implement the keep-alive logic before merging, or
    • Add a detailed TODO with a tracking issue reference explaining what needs to be done
    • Document whether this is safe to leave incomplete and what the runtime implications are
  2. Gateway ID Initialization (versioned.rs:493 and 681):

    gateway_id: [0; 16],  // Default to all zeros for v2 messages

    Issue: When converting v2 messages to v3, gateway_id defaults to zeros. This could cause routing issues if older protocol versions are still in use.

    Recommendation: Document whether v2 protocol is still supported in production. If yes, add a comment explaining how zero gateway_id is handled in routing logic.

  3. Error Handling in GC Logic (shared_state.rs:447-453):

    if let Some(earliest_pending_msg) = req.pending_msgs.first() {
        if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
            break 'reason Some(MsgGcReason::MessageNotAcked{message_id:Uuid::from_bytes(earliest_pending_msg.message_id)});
        }
    }

    Issue: The comparison operator <= appears incorrect - this would trigger timeout immediately. Should be >=.

    Critical Bug: This will cause requests to be garbage collected prematurely!

  4. Same Issue in WebSocket GC (shared_state.rs:458-460):

    if now.duration_since(earliest_pending_ws_msg.send_instant) <= HWS_MESSAGE_ACK_TIMEOUT {
        break 'reason Some(MsgGcReason::WebSocketMessageNotAcked{last_ws_msg_index: hs.last_ws_msg_index});
    }

    Critical Bug: Same logic error with <= instead of >=.

  5. Removed Cleanup Logic (pegboard-runner/src/lib.rs:246-277):
    The PR removes cleanup code that sent ToServerResponseAbort messages for active HTTP requests when the WebSocket closes.

    Concern: How are in-flight HTTP requests now cleaned up on runner disconnect? The comment says WebSockets are handled differently, but what about HTTP requests?

    Recommendation: Verify that gateway-side timeout handling properly cleans up orphaned HTTP requests.

Potential Bugs

🔴 Critical

  1. GC Timeout Comparison Bug (as noted above in points 3 & 4):

    // WRONG: Triggers immediately
    if now.duration_since(send_instant) <= TIMEOUT
    
    // CORRECT: Should be
    if now.duration_since(send_instant) >= TIMEOUT

    Impact: All pending messages will be incorrectly flagged for garbage collection immediately.

    Fix: Change both occurrences from <= to >=.

⚠️ Medium

  1. Incomplete Keep-Alive Implementation:
    The ToGatewayKeepAlive and ToRunnerKeepAlive types are defined but not fully implemented. Without keep-alive acknowledgements, how is connection health monitored?

  2. Migration Path Unclear:
    The PR changes protocol semantics significantly. Is there a deployment plan to ensure gateways and runners are upgraded atomically? What happens if v3 gateway talks to v2 runner?

Performance Considerations

✅ Improvements

  1. Reduced Message Overhead: Eliminating TunnelAck messages cuts bidirectional traffic in half for tunnel communication.

  2. Simplified Memory Management: Removing tunnel_active_requests HashMap eliminates per-request state tracking overhead.

  3. Direct Routing: Using gateway_id for direct subject construction (GatewayReceiverSubject::new(Uuid::from_bytes(msg.gateway_id))) is more efficient than dynamic reply-to tracking.

💡 Suggestions

  1. Consider whether pending_msgs in shared_state.rs:179-182 still serves a purpose now that acknowledgements are removed. If it's only for GC timeout, document this clearly.

Security Concerns

✅ No Critical Issues

No obvious security vulnerabilities introduced.

💡 Considerations

  1. Gateway ID Validation: Should ws_to_tunnel_task.rs:368-369 validate that the gateway_id exists before constructing the receiver subject? Otherwise, messages could be sent to arbitrary NATS subjects.

  2. Zero Gateway ID: The v2 compatibility code uses zero gateway_id. Ensure this doesn't allow message spoofing or routing to unintended gateways.

Test Coverage

❌ Concerns

  1. No Test Updates: The PR modifies core protocol behavior but doesn't include test updates or new tests.

  2. No Integration Tests: Given this changes gateway-runner communication, integration tests verifying the new flow would be valuable.

  3. GC Logic Untested: The timeout comparison bug suggests the GC logic may not have adequate test coverage.

Recommendation: Add tests covering:

  • Gateway-to-runner message flow without acknowledgements
  • GC timeout logic for pending messages
  • Protocol version conversion for v2→v3 migration
  • Keep-alive message handling (once implemented)

Additional Observations

  1. TypeScript SDK Changes: The TypeScript protocol updates in runner-protocol/src/index.ts properly remove TunnelAck from union tag ordering, maintaining serialization compatibility.

  2. Clean Deletion: The removal of is_to_server_tunnel_message_kind_request_close() helper function is appropriate since terminal state tracking is no longer needed.

  3. Documentation: The BARE schema (v3.bare) is well-documented with comments explaining the purpose of new types.

Recommendations Summary

Must Fix Before Merge

  • Fix GC timeout comparison operators (lines 449 and 459 in shared_state.rs)
  • Implement or remove ToGatewayKeepAlive handling (remove TODO or implement functionality)
  • Document v2 protocol compatibility (explain zero gateway_id handling)

Should Address

  • Add integration tests for new message flow
  • Verify HTTP request cleanup on runner disconnect
  • Document migration/deployment plan for protocol change
  • Add validation for gateway_id before routing

Nice to Have

  • Add tests for GC logic
  • Document whether pending_msgs is still needed
  • Consider adding metrics for message flow monitoring

Conclusion

This is a solid simplification of the tunnel protocol that removes unnecessary complexity. However, there are critical bugs in the GC timeout logic that must be fixed before merging. The incomplete keep-alive implementation also needs resolution.

Overall assessment: Needs work before merge due to timeout comparison bugs, but the architectural direction is sound.


Review generated with Claude Code

@MasterPtato MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 4adef6f to 759df3e Compare November 19, 2025 21:54
@MasterPtato MasterPtato force-pushed the 11-19-chore_engine-runner_remove_echo_close_on_toclientwebsocketclose branch from 972242d to 556b074 Compare November 19, 2025 21:54
@claude
Copy link

claude bot commented Nov 19, 2025

Code Review: Remove TunnelAck

Summary

This PR removes the TunnelAck message type from the runner protocol and simplifies the communication flow between the pegboard gateway and runner. The change eliminates the bidirectional acknowledgment mechanism for tunnel messages.


✅ Positive Observations

  1. Significant Code Simplification: The PR successfully removes ~405 lines while adding 374, resulting in a net reduction and cleaner architecture. The removal of tunnel_active_requests tracking state machine simplifies the codebase considerably.

  2. Protocol Versioning Handled Correctly: The changes properly handle backward compatibility in versioned.rs by converting deprecated TunnelAck messages to appropriate fallback types (ToClientRequestAbort/ToServerResponseAbort) for older protocol versions.

  3. Gateway ID Introduction: The addition of gateway_id field to tunnel messages (lines 165, 283 in v3.bare) allows the runner to determine the correct gateway to send responses to without maintaining state. This is a cleaner design pattern.

  4. Consistent Error Handling: The code properly uses structured logging with tracing's field syntax (e.g., tracing::debug!(?x, "foo")) as per CLAUDE.md guidelines.


⚠️ Issues & Concerns

1. Incomplete TODO Comments (High Priority)

Location: pegboard-gateway/src/shared_state.rs:227-249 and tunnel_to_ws_task.rs:56-58

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ... commented out ack handling logic
}

Issue: The ToGatewayKeepAlive handler is completely stubbed out with commented code. This suggests the feature is incomplete.

Questions:

  • What is the purpose of ToGatewayKeepAlive? Is this meant to replace TunnelAck functionality?
  • Why is the pending message cleanup commented out?
  • Should this be implemented before merging, or is this intentionally deferred?

Recommendation: Either implement the keepalive handling or document why it's deferred in the PR description.


2. Potential Message Ordering Issues (Medium Priority)

Location: pegboard-gateway/src/shared_state.rs:179-182

The code maintains a pending_msgs queue that tracks unacknowledged messages:

req.pending_msgs.push(PendingMessage {
    message_id,
    send_instant: now,
});

However, with TunnelAck removed, these pending messages are never removed from the queue except during GC timeouts (line 447-452).

Concerns:

  • The pending_msgs vector will continuously grow for long-lived requests
  • Memory leak potential for high-throughput connections
  • The GC logic checks if duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT which seems backwards (should be > not <=)

Line 448-450:

if now.duration_since(earliest_pending_msg.send_instant)
    <= MESSAGE_ACK_TIMEOUT
{
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

This condition triggers GC when the message is within the timeout window, which is the opposite of the intended behavior.

Recommendation:

  • Fix the GC timeout logic (change <= to >=)
  • Consider removing pending_msgs entirely if acks are no longer needed, or implement the keepalive mechanism to actually clear them

3. Removed Cleanup Logic Not Replaced (Medium Priority)

Location: pegboard-runner/src/lib.rs:246-276 (deleted code)

The PR removes cleanup code that sends close messages to active requests when the runner websocket disconnects:

// Send close messages to all remaining active requests
let active_requests = conn.tunnel_active_requests.lock().await;
for (request_id, req) in &*active_requests {
    // ... send ToServerResponseAbort
}

Issue: With this removal, when a runner disconnects, the gateway may not be notified that in-flight HTTP requests were terminated.

Recommendation: Verify that the gateway properly handles runner disconnections and cleans up in-flight requests without relying on explicit close messages. Document this behavior if intentional.


4. TypeScript Client Impact (Low-Medium Priority)

Location: engine/sdks/typescript/runner/src/tunnel.ts

The TypeScript changes are extensive (87 deletions, 36 additions). The diff shows removal of pendingMessages tracking but I cannot see the full implementation.

Questions:

  • Does the TypeScript runner have equivalent issues with message tracking?
  • Are there TypeScript tests covering the tunnel message flow?

Recommendation: Ensure TypeScript runner has test coverage for the new protocol behavior.


5. Protocol Version Migration (Low Priority)

Location: runner-protocol/src/versioned.rs:1232-1238 and 1331-1338

The fallback behavior for deprecated TunnelAck converts it to Abort messages:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Issue: The comment "This shouldn't happen in practice" suggests this is dead code, but it's still reachable if v2 clients send TunnelAck.

Recommendation: Consider whether this should log a warning or if the conversion to Abort is the correct behavior.


🔒 Security Considerations

No security vulnerabilities identified. The changes don't introduce:

  • Command injection risks
  • SQL injection (no database queries modified)
  • XSS vulnerabilities
  • Improper authentication/authorization changes

The gateway_id is properly typed as a 16-byte array (UUIDv4) which is appropriate.


🚀 Performance Considerations

Positive:

  • Removing the ack mechanism reduces message overhead (2 messages per tunnel message → 1 message)
  • Less state tracking should improve memory usage

Concern:

  • The pending message memory leak issue mentioned above could degrade performance over time

🧪 Test Coverage

Missing Information: The PR doesn't show test changes.

Recommendations:

  1. Add tests for ToGatewayKeepAlive / ToRunnerKeepAlive behavior
  2. Test that gateway_id is correctly populated and used for routing
  3. Test backward compatibility with v2 protocol clients
  4. Test runner disconnection cleanup scenarios

📝 Documentation & Code Style

Adherence to CLAUDE.md: ✅ Generally good

  • Imports are at the top of files ✅
  • Uses structured logging correctly ✅
  • Uses workspace dependencies ✅
  • Hard tabs are preserved (rustfmt.toml) ✅

Issues:

  • TODO comments should either be implemented or have tracking issues
  • The PR description is empty - should explain the rationale for removing TunnelAck

Recommendations Summary

Must Fix Before Merge (Breaking Issues):

  1. ❗ Fix the GC timeout logic at shared_state.rs:448-450 (inverted condition)
  2. ❗ Implement or document the ToGatewayKeepAlive handler
  3. ❗ Verify runner disconnection cleanup works correctly without explicit close messages

Should Fix Before Merge:

  1. Add test coverage for the new protocol behavior
  2. Add PR description explaining the rationale and design decisions
  3. Consider removing pending_msgs tracking if acks are truly no longer needed

Nice to Have:

  1. Document the migration path for existing deployments
  2. Add warning logs for deprecated TunnelAck conversions

Overall, this is a good simplification of the protocol, but there are a few critical issues that should be addressed before merging. The inverted timeout condition is a blocking bug, and the incomplete keepalive implementation needs clarification.

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: Remove TunnelAck

Overview

This PR removes the TunnelAck message type from the runner protocol, simplifying the acknowledgment mechanism between the pegboard gateway and runner. This is a significant protocol change that affects bidirectional message flow.


✅ Strengths

  1. Good Protocol Versioning: The change properly maintains backwards compatibility through the versioned protocol system. The conversion functions in versioned.rs handle v2 → v3 upgrades gracefully by mapping the removed TunnelAck to terminal states.

  2. Cleaner Architecture: Removing the explicit ack mechanism simplifies the state machine:

    • Removed tunnel_active_requests HashMap tracking
    • Removed complex ack send/receive logic
    • Cleaner separation between gateway and runner responsibilities
  3. New ToRunner Message Type: The introduction of ToRunner as a separate type from ToClient provides better type safety and future extensibility with ToRunnerKeepAlive.

  4. Gateway ID Addition: Adding gateway_id to tunnel messages enables better routing without maintaining stateful reply-to subjects.


🔍 Issues & Concerns

1. Incomplete Keep-Alive Implementation

Location: pegboard-gateway/src/shared_state.rs:227-250

The ToGatewayKeepAlive handling is stubbed out with TODO comments:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Concern: The commented code suggests this was previously handling ack logic for pending messages. Without this:

  • How are pending messages being garbage collected now?
  • Could this lead to memory leaks in in_flight.pending_msgs?
  • The MESSAGE_ACK_TIMEOUT and pending_msgs fields still exist but may not be used correctly

Recommendation: Either complete the keep-alive implementation or remove the dead code (pending_msgs, timeouts, etc.) if the ack mechanism is truly unnecessary.


2. Stale TypeScript Code

Location: engine/sdks/typescript/runner/src/stringify.ts:35-36, 77-78

The stringify functions still handle TunnelAck:

case "TunnelAck":
    return "TunnelAck";

Issue: This creates a mismatch between the protocol definition (v3 removes TunnelAck) and the client SDK code. While the versioning layer handles conversion, the TypeScript code should be updated to:

  1. Remove TunnelAck cases from v3 stringify functions
  2. Add handling for new ToRunnerKeepAlive messages

Recommendation: Update TypeScript SDK to match protocol v3 exactly. Consider having separate stringify functions per protocol version if needed.


3. Backwards Compatibility Edge Case

Location: engine/sdks/rust/runner-protocol/src/versioned.rs:1231-1237, 1331-1338

The conversion functions map deprecated TunnelAck to abort states:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // TunnelAck is deprecated and should not be used
    // For backwards compatibility, we skip it
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Concern:

  • The comment says "we skip it" but the code actually converts to Abort - this is inconsistent
  • If an old v2 runner sends a TunnelAck, it will be misinterpreted as a request abort
  • Could cause active requests to be terminated incorrectly during a rolling upgrade

Recommendation:

  • Either truly "skip" the message (return an Option or filter it out)
  • Or document clearly that v2 runners are no longer supported
  • Add integration tests for mixed-version scenarios

4. Missing Error Handling

Location: pegboard-runner/src/tunnel_to_ws_task.rs:56-59

protocol::ToRunner::ToRunnerKeepAlive(_) => {
    // TODO:
    continue;
}

Issue: Keep-alive messages are silently ignored. If keep-alives are meant to maintain connection health, there should be:

  • Logging at trace/debug level
  • Potential response mechanism
  • Timeout tracking

Recommendation: At minimum, add a trace log. Better yet, implement proper keep-alive handling or remove the message type if unused.


5. Protocol Version Not Bumped in Migration Code

Location: engine/sdks/rust/runner-protocol/src/versioned.rs:175-183

When converting v2 → v3, the code sets gateway_id: [0; 16]:

v3::ToClient::ToClientTunnelMessage(v3::ToClientTunnelMessage {
    gateway_id: [0; 16],  // ← Zero gateway ID for v2 messages
    request_id: msg.request_id,
    // ...
})

Concern: A zero UUID could cause routing issues if the gateway ID is used for message routing. The code in ws_to_tunnel_task.rs:368 uses this gateway_id to construct the reply subject:

let gateway_reply_to = GatewayReceiverSubject::new(Uuid::from_bytes(msg.gateway_id)).to_string();

Recommendation:

  • Document that v2 protocol is unsupported, OR
  • Find a way to populate the correct gateway_id during conversion, OR
  • Add validation to reject zero gateway IDs

6. Removed Connection Cleanup Logic

Location: pegboard-runner/src/lib.rs:246-278 (deleted code)

The PR removes the cleanup logic that sent abort messages to active requests when the runner websocket closes:

// Send close messages to all remaining active requests
// ... [deleted 37 lines]

Concern: Without this cleanup:

  • How does the gateway know when a runner disconnects?
  • Could lead to dangling requests waiting indefinitely
  • The comment mentions "Websockets are not ephemeral like requests" but what about HTTP requests?

Recommendation: Verify that the gateway has alternative mechanisms to detect runner disconnection and clean up in-flight requests. This might be handled elsewhere, but it should be documented or tested.


🔒 Security Considerations

  1. Gateway ID Validation: Ensure gateway_id is validated before use to prevent potential routing to unintended gateways
  2. No obvious injection vulnerabilities - the changes don't introduce user-controlled data into critical paths

⚡ Performance Considerations

Positive impacts:

  • Removed HashMap lock contention on tunnel_active_requests
  • Eliminated ack message round-trips (lower latency)
  • Simplified message processing pipeline

Potential concerns:

  • If keep-alive mechanism is not properly implemented, could lead to connection timeouts
  • Pending message tracking (pending_msgs) still exists but may not be pruned correctly

🧪 Test Coverage

Missing:

  • No visible tests for protocol v2 → v3 conversion
  • No integration tests for mixed-version gateway/runner scenarios
  • No tests verifying proper handling of dropped connections without active request cleanup

Recommendation: Add tests for:

  1. Protocol conversion with TunnelAck messages
  2. Gateway ID routing correctness
  3. Connection cleanup behavior
  4. Keep-alive message handling

📋 Summary

This is a well-intentioned simplification of the protocol, but it has several incomplete areas that need addressing before merge:

Must Fix:

  1. ✅ Complete or remove the keep-alive TODO sections
  2. ✅ Update TypeScript SDK to match protocol v3
  3. ✅ Clarify backwards compatibility story for v2 TunnelAck conversion
  4. ✅ Verify connection cleanup still works without active request tracking

Should Fix:
5. Add test coverage for protocol conversion
6. Document the zero gateway_id behavior for v2 messages
7. Add logging for keep-alive handling

Nice to Have:
8. Integration tests for mixed-version scenarios
9. Performance testing to verify the removed ack mechanism doesn't impact reliability


💡 Recommendations

  1. Short term: Complete the TODOs or remove dead code to make the intent clear
  2. Medium term: Add comprehensive tests for the protocol changes
  3. Long term: Consider documenting the protocol evolution and migration strategy for operators

Overall, the architectural direction is sound, but the implementation feels incomplete. The TODOs suggest this might have been a partial revert or work-in-progress that needs finishing touches.

@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review

This PR removes the TunnelAck mechanism from the pegboard-gateway/runner communication protocol. I've identified several important issues and areas for improvement:

🔴 Critical Issues

1. Breaking Protocol Change Without Migration Path

  • The PR removes TunnelAck from the v3 protocol but the conversion functions in versioned.rs handle it incorrectly
  • Lines 1231-1238 and 1331-1338 convert TunnelAck to ToClientRequestAbort/ToServerResponseAbort with a comment saying "this shouldn't happen in practice"
  • Problem: If any v2 clients/runners exist, they could send TunnelAck messages that get converted to abort messages, potentially breaking active connections
  • Recommendation: Either ensure all clients are upgraded first, or handle the conversion more gracefully (e.g., ignore/skip the message)

2. Incomplete TODO Comments

  • Lines 227-250 in shared_state.rs and lines 54-57 in tunnel_to_ws_task.rs have TODO comments with commented-out code
  • The TODO in shared_state.rs suggests there was logic for handling pending messages that needs to be addressed
  • Recommendation: Either implement the TODOs or remove them if they're no longer needed. Leaving incomplete TODOs in production code creates technical debt

3. Missing Error Handling for WebSocket Message Index

  • Line 1220 in versioned.rs now uses msg.index when converting v2 to v3, but v2 ToClientWebSocketMessage may not have had this field
  • The comment on line 1219 was removed: "Default to 0 for v2 messages (hibernation disabled by default)"
  • Problem: This could panic or produce incorrect behavior if v2 messages don't have the index field
  • Recommendation: Verify v2 schema has this field, or add a default value

⚠️ Moderate Issues

4. State Management Simplification May Hide Issues

  • The PR removes tunnel_active_requests tracking entirely from the runner (conn.rs, lib.rs)
  • Previously, the runner tracked active requests and sent close messages on disconnection (lines 246-276 in lib.rs)
  • Impact: If a runner disconnects, gateways may not be notified about incomplete HTTP requests (WebSockets are explicitly skipped)
  • Recommendation: Verify that request timeout/cleanup is handled elsewhere (e.g., gateway-side timeouts)

5. Gateway ID Field Always Set to Zero for Legacy Messages

  • Lines 177 and 493 in versioned.rs set gateway_id: [0; 16] when converting v2 messages to v3
  • Problem: This means any v2 messages will be routed to gateway "00000000-0000-0000-0000-000000000000"
  • Recommendation: Either ensure no v2 clients exist, or add validation/error handling for zero gateway IDs

6. Message Routing Changed from Reply-To to Gateway ID

  • In ws_to_tunnel_task.rs (lines 365-373), the runner now derives gateway_reply_to from msg.gateway_id instead of looking it up from active requests
  • This is cleaner but requires the gateway to always populate gateway_id correctly
  • Recommendation: Add validation that gateway_id is non-zero before using it

✅ Positive Changes

  1. Simplified Architecture: Removing the ack mechanism simplifies the state management significantly
  2. Cleaner Message Flow: Using gateway_id in messages is more explicit than managing reply-to subjects
  3. Reduced Memory Overhead: No longer need to track tunnel_active_requests HashMap
  4. Type Safety: Introduction of ToRunner enum separates runner-specific messages from client messages

📋 Minor Issues

7. Inconsistent Logging

  • Line 35 in shared_state.rs: Uses structured logging correctly with ?gateway_id
  • Good adherence to CLAUDE.md logging patterns

8. Code Style

  • Hard tabs are used correctly (per rustfmt.toml)
  • Imports are at the top of files as per guidelines
  • No glob imports from anyhow (good!)

🧪 Testing Recommendations

  1. Backwards Compatibility: Test that v2 runners/clients can still communicate (or verify they're all upgraded)
  2. Gateway ID Validation: Test behavior when gateway_id is all zeros
  3. Connection Cleanup: Verify that incomplete requests are cleaned up when runners disconnect
  4. Message Ordering: Ensure the removal of acks doesn't cause message ordering issues

📝 Documentation

The PR description is empty. Consider adding:

  • Why TunnelAck is being removed
  • What replaces the acknowledgment mechanism (if anything)
  • Migration notes for deployments
  • Testing performed

Summary

This is a significant protocol change that simplifies the architecture. However, there are critical issues around backwards compatibility and incomplete TODOs that must be addressed before merging. The main concerns are:

  1. Handle v2→v3 conversion of TunnelAck more gracefully
  2. Complete or remove TODO comments
  3. Verify msg.index field exists in v2 schema
  4. Ensure request cleanup happens on runner disconnection
  5. Validate non-zero gateway_id values

Recommendation: Request changes to address the critical issues before merging.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

I've reviewed PR #3488 which removes the TunnelAck mechanism from the pegboard protocol. Overall, this is a solid simplification that reduces protocol complexity. Here are my findings:

Summary

This PR removes the bidirectional acknowledgment system (TunnelAck) from the tunnel protocol between the pegboard-gateway and pegboard-runner. Instead of explicit ACKs, the gateway now uses a gateway_id to route messages back to the correct gateway instance.

✅ What's Done Well

Protocol Design

  • Clean removal of the TunnelAck message type from both ToServerTunnelMessageKind and ToClientTunnelMessageKind
  • Introduction of GatewayId type provides a cleaner routing mechanism
  • New ToRunner union type properly separates gateway-specific messages from client messages
  • ToGateway is now a proper union instead of a wrapper struct

Backwards Compatibility

  • Good handling of the TunnelAck deprecation in versioned.rs (lines 1231-1237, 1325-1332)
  • Converting v2 TunnelAck to ToClientRequestAbort/ToServerResponseAbort is a reasonable fallback
  • Protocol version handling allows gradual migration

Code Simplification

  • Removed complex state tracking in tunnel_active_requests HashMap from conn.rs
  • Eliminated 100+ lines of ACK handling logic from shared_state.rs
  • Removed cleanup logic for active requests on WS close (lib.rs:243-276)
  • TypeScript tunnel.ts simplified by removing gatewayReplyTo tracking

⚠️ Issues & Concerns

1. Critical: Incomplete Keep-Alive Implementation

The PR introduces ToRunnerKeepAlive and ToGatewayKeepAlive but leaves them unimplemented with // TODO: comments:

// shared_state.rs:227-250
Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // [commented out ack handling code]
}
// tunnel_to_ws_task.rs:56-59
protocol::ToRunner::ToRunnerKeepAlive(_) => {
    // TODO:
    continue;
}

Recommendation: Either implement the keep-alive mechanism or remove these message types entirely in this PR. Leaving them as TODOs creates dead code paths and unclear intent.

2. Potential Bug: Logic Inversion in ws_to_tunnel_task.rs

The gateway_id is now extracted from the incoming message to determine where to send replies:

// ws_to_tunnel_task.rs:368-369
let gateway_reply_to =
    GatewayReceiverSubject::new(Uuid::from_bytes(msg.gateway_id)).to_string();

However, when converting from v2 to v3 protocol, the gateway_id is set to zeros:

// versioned.rs:177-178, 492-493
v3::ToClient::ToClientTunnelMessage(v3::ToClientTunnelMessage {
    gateway_id: [0; 16],  // ⚠️ Will route to wrong gateway!
    // ...
})

Impact: This means v2 clients won't be able to receive responses correctly as messages will route to gateway UUID 00000000-0000-0000-0000-000000000000.

Recommendation: The conversion functions need to preserve or properly generate a valid gateway_id. Consider requiring v3+ for this feature or finding another way to handle v2 compatibility.

3. Potential Bug: Missing gateway_id in TypeScript

The TypeScript tunnel implementation doesn't appear to set gateway_id when sending messages:

// tunnel.ts:430-437
const message: protocol.ToServer = {
    tag: "ToServerTunnelMessage",
    val: {
        requestId,
        messageId,
        messageKind,
        // ⚠️ Missing gateway_id field
    },
};

Looking at the protocol definition, ToServerTunnelMessage requires gatewayId:

type ToServerTunnelMessage struct {
    gatewayId: GatewayId
    requestId: RequestId
    messageId: MessageId
    messageKind: ToServerTunnelMessageKind
}

Impact: TypeScript runners won't be able to send tunnel messages correctly.

Recommendation: The TypeScript code needs to be updated to include the gateway_id field. The runner should extract this from the incoming ToClientTunnelMessage.gatewayId field.

4. Performance: Unbounded pending_msgs Growth

The pending_msgs Vec in InFlightRequest (shared_state.rs:39) is never cleared in the new implementation since ACKs are removed:

req.pending_msgs.push(PendingMessage {
    message_id,
    send_instant: now,
});

Impact: For long-lived requests (especially WebSockets), this will cause memory growth proportional to the number of messages sent.

Recommendation: Either:

  • Remove the pending_msgs field entirely if it's no longer needed
  • Implement the keep-alive mechanism to clear old pending messages
  • Add a bounded capacity with eviction

5. Code Quality: Commented-Out Code

Large blocks of commented-out code should be removed:

  • shared_state.rs:228-249 (50+ lines)
  • The TODO comments suggest uncertainty about whether this functionality is still needed

Recommendation: Remove commented code. Git history preserves it if needed later.

🔍 Minor Issues

6. Inconsistent Error Handling in GC Logic

shared_state.rs:448-452 has inverted logic for timeout checking:

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

This triggers GC when messages are <= MESSAGE_ACK_TIMEOUT, which seems backwards. Should be >=.

Update: Actually looking more carefully, this is inside a labeled block that returns None by default, so the logic may be intentionally inverted. However, it's confusing and deserves a comment.

7. TypeScript Type Safety

tunnel.ts:432-436 builds a message but doesn't include the new gatewayId field that's required by the protocol. TypeScript should catch this if the types are properly defined.

🧪 Testing Recommendations

  1. Protocol compatibility tests - Verify v2 clients can still communicate with v3 gateways
  2. Gateway routing tests - Ensure messages route to the correct gateway instance using gateway_id
  3. WebSocket lifecycle tests - Verify long-lived WebSocket connections don't leak memory
  4. Reconnection tests - Test gateway failover scenarios with the new routing mechanism

📊 Security Considerations

No major security concerns identified. The removal of ACK tracking simplifies the attack surface.

📝 Documentation Needs

Consider adding:

  • Migration guide for the protocol change
  • Explanation of why TunnelAck was removed and what replaces it
  • Documentation of the new gateway routing mechanism

Overall Assessment: This is a good simplification, but it has critical implementation gaps that need to be addressed before merging. The keep-alive TODOs and TypeScript gateway_id issue are blockers.

@MasterPtato MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from bb070a3 to bdf8c0e Compare November 20, 2025 22:38
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Remove TunnelAck

This PR simplifies the tunnel protocol by removing the TunnelAck message type and related state tracking. Overall, the changes look good and represent a meaningful simplification. Here are my findings:

Strengths

  1. Architecture Improvement: Removing the ack mechanism and tunnel_active_requests tracking simplifies the state management significantly. This reduces complexity and potential bugs.

  2. Protocol Changes: The new gateway_id field in tunnel messages is a clean solution that allows the runner to determine the reply-to subject without maintaining state.

  3. Backward Compatibility: The version conversion functions properly handle the removed TunnelAck by converting it to abort messages, maintaining protocol compatibility during upgrades.

  4. Code Cleanup: Removal of ~50 lines of complex tracking logic in tunnel_to_ws_task.rs and the entire tunnel_active_requests HashMap improves maintainability.

🔍 Concerns & Recommendations

1. Missing Cleanup Logic (Critical)

Location: engine/packages/pegboard-runner/src/lib.rs:243-279 (removed)

The PR removes logic that sends ToServerResponseAbort messages to the gateway when the runner WebSocket closes. This cleanup was important for non-WebSocket (HTTP) requests.

Before: When a runner disconnected, all active HTTP requests received abort messages.
After: No abort messages are sent, potentially leaving the gateway waiting.

Impact: HTTP requests may hang or timeout instead of failing fast when a runner disconnects.

Recommendation: Consider whether this cleanup is still needed. If the gateway can detect runner disconnection through other means (e.g., watching the in_flight_requests map), document this. Otherwise, restore the cleanup logic.

2. TODOs Left in Code

Location:

  • engine/packages/pegboard-gateway/src/shared_state.rs:227-249
  • engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:57

Multiple TODO comments reference the old acknowledgment logic but are not implemented:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Recommendation: Either implement the keep-alive logic or remove these TODOs if they're no longer needed. Commented-out code should generally be removed.

3. Gateway ID Initialization in Conversions

Location: engine/sdks/rust/runner-protocol/src/versioned.rs:178,493

When converting from v2 to v3, the gateway_id is initialized to [0; 16]:

gateway_id: [0; 16],

Concern: A zero UUID could cause issues if the code expects valid UUIDs.

Recommendation: Document whether zero UUIDs are acceptable for backward compatibility scenarios, or add validation to reject them.

4. Pending Messages Not Cleared

Location: engine/packages/pegboard-gateway/src/shared_state.rs:179-182

The gateway still maintains a pending_msgs vector and pushes messages to it, but the ack logic that clears this vector is commented out (line 227-249). This means:

  • Messages accumulate in memory indefinitely
  • The GC logic checks for timeout (line 447-453) but never clears pending messages

Impact: Memory leak for long-running connections with many messages.

Recommendation: Either implement the new keep-alive clearing logic or remove the pending_msgs vector entirely if acknowledgments are no longer tracked.

5. TypeScript SDK Changes

Location: engine/sdks/typescript/runner-protocol/src/index.ts

The TypeScript changes look mechanically correct, properly removing TunnelAck and adjusting tag indices. However:

  • Ensure client code doesn't depend on the removed TunnelAck functionality
  • The tag number changes mean old clients can't talk to new servers (breaking change)

Recommendation: Version this as a breaking change and ensure all TypeScript clients are updated.

📝 Minor Issues

  1. Unused Code: The wrapping_lt function is commented out (line 519-521 in shared_state.rs). Should be removed.

  2. Log Messages: Several debug logs still reference "ack" behavior that no longer exists (e.g., line 232).

  3. Error Handling: The conversion fallback for TunnelAck to ToClientRequestAbort (line 1234) seems arbitrary. Consider if this is the right default.

🎯 Summary

This is a good refactoring that simplifies the protocol. The main concerns are:

  1. The missing cleanup logic for active requests (needs investigation)
  2. TODOs and incomplete keep-alive implementation
  3. Potential memory leak from pending_msgs not being cleared

Verdict: ⚠️ Needs Minor Revisions - Address the critical cleanup logic and memory leak concerns before merging.

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Remove TunnelAck (#3488)

Summary

This PR removes the TunnelAck acknowledgment mechanism from the pegboard gateway/runner protocol, simplifying the tunnel message flow between gateway and runner. The change involves protocol v3 updates and restructuring of message types.

Code Quality and Best Practices

Strengths ✅

  1. Clean protocol simplification: Removes bidirectional acknowledgment complexity, reducing latency and simplifying state management
  2. Proper versioning: Uses protocol version handling to maintain backwards compatibility through conversion functions
  3. Good logging: Uses structured logging with tracing throughout (e.g., tracing::debug!(?x, "message"))
  4. Consistent error handling: Proper use of anyhow::Result and context propagation
  5. Gateway ID introduction: Adding gateway_id to tunnel messages enables proper subject routing without maintaining active request state

Areas for Improvement 🔧

  1. TODO comments left in production code (shared_state.rs:228-249)

    • Large commented block for TunnelAck handling should either be removed or converted to proper TODO issues
    • Consider: Remove dead code or track with GitHub issue if planning to reimplement
  2. Backwards compatibility handling (versioned.rs:1231-1238, 1331-1338)

    • Converting deprecated TunnelAck to ToClientRequestAbort/ToServerResponseAbort is questionable
    • These are semantically different - an ack is not an abort
    • Consider: Log a warning when encountering v2 TunnelAck messages instead of silently converting
  3. WebSocket index handling (versioned.rs:1220)

    • Comment says "Default to 0 for v2 messages" but code now uses msg.index
    • Verify v2 protocol actually has this field or if this breaks backwards compatibility
  4. Missing import cleanup (conn.rs)

    • Removed TunnelActiveRequest struct but imports may still reference unused types
    • Run ./scripts/cargo/fix.sh to ensure clean build

Potential Bugs 🐛

Critical Issues

  1. GC logic timeout comparison inverted (shared_state.rs:448-452, 458-461)

    if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
        break 'reason Some(MsgGcReason::MessageNotAcked{...});
    }

    BUG: Should be >= not <=. Current code triggers timeout when messages are WITHIN the timeout window, not after exceeding it.

    Fix:

    if now.duration_since(earliest_pending_msg.send_instant) >= MESSAGE_ACK_TIMEOUT {

    Same issue at line 458-461 for WebSocket messages.

  2. Gateway ID zero-initialization (versioned.rs:178, 493)

    • When converting v2 to v3, gateway_id is set to [0; 16]
    • Zero UUID is semantically invalid and will cause routing failures
    • Impact: v2 clients cannot communicate through v3 gateway
    • Consider: Reject v2 messages or generate valid gateway IDs

Medium Priority

  1. Message ordering during hibernation (shared_state.rs:156-162)

    • WebSocket message index wrapping calculation: hs.last_ws_msg_index.wrapping_add(1)
    • Comment says "This ends up skipping 0 as an index when initiated but whatever"
    • Could cause index confusion if index 0 is special-cased elsewhere
  2. No request lifecycle cleanup (lib.rs:243-278 removed)

    • Removed code that sent close messages to active requests on runner disconnect
    • Now relies purely on GC timeouts
    • Impact: HTTP requests may hang for MESSAGE_ACK_TIMEOUT (30s) instead of immediate close
    • Consider: Add explicit close signaling or document timeout behavior

Performance Considerations ⚡

Improvements

  1. Reduced network overhead: No more ACK roundtrips saves ~50% of tunnel messages
  2. Less state tracking: Removed tunnel_active_requests HashMap reduces memory pressure
  3. Simpler message flow: Gateway no longer needs to track request lifecycle

Concerns

  1. No message delivery confirmation: Without ACKs, cannot detect message loss

    • Relies on TCP reliability, but pub/sub layer (NATS) could drop messages
    • Consider: Add application-level health checks or timeout monitoring
  2. Pending message GC complexity (shared_state.rs:379-504)

    • Two-phase GC commit is complex and runs every 15 seconds
    • Could accumulate pending messages if GC is slow
    • Recommendation: Add metrics for GC duration and pending message counts

Security Concerns 🔒

Low Risk

  1. Gateway ID routing - Using UUID from message for subject routing is safe if validated
  2. No new attack surface - Simplification generally improves security posture

Recommendations

  1. Validate gateway_id: Ensure non-zero UUID before routing messages
  2. Rate limiting: Without ACKs, no natural backpressure - ensure rate limits exist elsewhere

Test Coverage 🧪

Issues

  1. No test files modified - Breaking protocol change with no test updates
  2. Backwards compatibility untested - v2→v3 conversion should have tests
  3. GC logic untested - Complex timeout logic needs unit tests

Recommendations

  1. Add integration tests for v2/v3 protocol conversion
  2. Add unit tests for GC timeout logic (especially the timeout comparison bug!)
  3. Test hibernation WebSocket message index wrapping
  4. Test gateway_id routing with various UUID values

Breaking Changes ⚠️

This is a breaking protocol change requiring coordinated deployment:

  1. Gateway and runner must upgrade together
  2. v2 clients will have routing issues (zero gateway_id)
  3. Consider adding migration period with dual protocol support

Action Items

Must Fix Before Merge

  • Fix timeout comparison bug in GC logic (shared_state.rs:448-452, 458-461)
  • Address gateway_id zero-initialization for v2 compatibility
  • Remove or properly document TODO block (shared_state.rs:228-249)
  • Add tests for protocol conversion and GC logic

Should Fix

  • Improve backwards compatibility error handling
  • Add explicit request cleanup on runner disconnect
  • Verify WebSocket index 0 handling is correct
  • Add metrics for pending message GC

Nice to Have

  • Document deployment procedure for breaking change
  • Add health checks for message delivery
  • Consider adding configurable timeout values

Overall Assessment

Rating: ⚠️ Needs Work

The PR achieves good protocol simplification goals but has critical bugs that must be fixed before merge. The timeout comparison logic is inverted which will break message GC. Additionally, backwards compatibility with v2 needs clarification or rejection.

Once the critical issues are addressed and tests are added, this will be a solid improvement to the protocol.

@NathanFlurry NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from bdf8c0e to bb070a3 Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-19-chore_engine-runner_remove_echo_close_on_toclientwebsocketclose branch from 5046fa3 to 95c2777 Compare November 20, 2025 23:32
@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Remove TunnelAck

Overview

This PR removes the TunnelAck acknowledgment mechanism from the tunnel protocol, simplifying the communication flow between pegboard-gateway and pegboard-runner. The change replaces the previous request-response acknowledgment pattern with a stateless routing approach using gateway_id.

Architecture Changes

Protocol Simplification ✅

Good: The removal of TunnelAck simplifies the protocol significantly:

  • Eliminates bidirectional acknowledgment messages
  • Removes complex state tracking on both gateway and runner sides
  • Reduces message overhead and latency

New Routing Mechanism ✅

Good: The introduction of gateway_id provides a cleaner routing solution:

  • Messages now include a gateway_id field (UUID) to identify the destination gateway
  • Runner can directly construct the gateway reply-to subject without maintaining active request state
  • Gateway can be identified from the message itself rather than tracking request state

Code Quality

State Management Cleanup ✅

Excellent: Significant reduction in state complexity:

  • Removed tunnel_active_requests HashMap from Conn structure (pegboard-runner/src/conn.rs:25-28)
  • Removed TunnelActiveRequest struct and associated lifecycle management
  • Eliminated complex cleanup logic on connection close (lib.rs:246-278)

Gateway Changes ✅

Well-structured: Gateway changes are minimal and clean:

  • Added gateway_id field to SharedStateInner (shared_state.rs:62)
  • Gateway includes its ID in outgoing tunnel messages (shared_state.rs:165)
  • Simplified message deserialization by handling ToGatewayKeepAlive separately

Runner Changes ✅

Clean refactoring:

  • handle_tunnel_message now stateless - derives gateway subject from gateway_id (ws_to_tunnel_task.rs:368-369)
  • Removed complex logic for tracking terminal states
  • Removed is_to_server_tunnel_message_kind_request_close function (no longer needed)

Potential Issues

1. Incomplete Implementation ⚠️

Medium Priority: There are TODO comments indicating incomplete work:

// shared_state.rs:227-249
Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}
// tunnel_to_ws_task.rs:55-56
protocol::ToRunner::ToRunnerKeepAlive(_) => {
    // TODO:
    continue;
}

Impact: Keep-alive messages are now ignored, which could affect reliability. The commented code suggests this was previously used for tracking pending messages and implementing retries.

Recommendation: Either implement the keep-alive handling or remove it entirely from the protocol if it's no longer needed.

2. Backwards Compatibility Handling ⚠️

Low Priority: The v2-to-v3 conversion for TunnelAck maps to Abort messages:

// versioned.rs:1232-1237, 1331-1338
v2::ToClientTunnelMessageKind::TunnelAck => {
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Concern: If any v2 clients actually send TunnelAck, they'll be converted to abort messages, which could cause unexpected connection terminations.

Recommendation: Document this breaking change or consider whether v2 protocol support can be dropped entirely.

3. TypeScript Code Not Updated 🔴

High Priority: The TypeScript stringify helpers still reference TunnelAck:

// engine/sdks/typescript/runner/src/stringify.ts:35-36, 77-78
case "TunnelAck":
    return "TunnelAck";

Impact: This will cause TypeScript compilation errors since TunnelAck no longer exists in the generated protocol types.

Recommendation: Update stringify.ts to remove the TunnelAck cases from both stringifyToServerTunnelMessageKind and stringifyToClientTunnelMessageKind functions.

4. Hibernation State Management ⚠️

Low Priority: The pending messages tracking is still present:

// shared_state.rs:179-182
let now = Instant::now();
req.pending_msgs.push(PendingMessage {
    message_id,
    send_instant: now,
});

Observation: Messages are still being tracked as pending, but the keep-alive/ack mechanism that would remove them from the pending list is not implemented (see TODO comments).

Recommendation: Clarify whether pending message tracking is still needed. If not, remove it to simplify the code further.

5. Missing gateway_id Field in v2 Conversion ⚠️

Low Priority: When converting from v2 to v3, gateway_id is set to all zeros:

// versioned.rs:177, 492
gateway_id: [0; 16],

Concern: This could cause routing issues if v2 clients are still active, as messages would be sent to gateway ID 00000000-0000-0000-0000-000000000000.

Recommendation: Either ensure v2 protocol is no longer supported, or implement proper gateway ID handling for v2 compatibility.

Performance Considerations

Positive Impact ✅

  • Reduced message count (no more acks for every message)
  • Eliminated HashMap lookups for request tracking on both sides
  • Reduced memory footprint (no more active request state)
  • Lower latency (no round-trip for acknowledgments)

Security Considerations

Gateway ID Validation ❓

Question: Is there any validation that the gateway_id in incoming messages matches an actual gateway?

Risk: A malicious actor could potentially send tunnel messages with arbitrary gateway IDs to cause messages to be routed incorrectly.

Recommendation: Consider adding validation that the gateway ID corresponds to an active, authorized gateway instance.

Testing Coverage

Missing Test Updates ⚠️

Observation: No test file changes are included in this PR.

Recommendation: Ensure that:

  1. Integration tests covering tunnel message flow are updated
  2. Tests verify that messages route correctly using gateway_id
  3. Edge cases (invalid gateway IDs, connection drops) are tested

Summary

Overall Assessment: Good refactoring that simplifies the protocol significantly. The removal of the acknowledgment mechanism reduces complexity and should improve performance.

Must Address Before Merge:

  1. 🔴 Fix TypeScript stringify.ts compilation errors
  2. ⚠️ Implement or remove keep-alive handling (decide on approach)
  3. ⚠️ Clarify v2 backward compatibility strategy

Recommended Follow-ups:

  • Add validation for gateway_id routing
  • Update tests to cover new message flow
  • Document protocol changes in architecture docs
  • Consider removing pending message tracking if no longer needed

Approval: Conditional - fix the TypeScript compilation issue first, then clarify the keep-alive and pending message tracking strategy.


Review generated by Claude Code

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Remove TunnelAck

Summary

This PR removes the TunnelAck mechanism from the pegboard tunnel protocol, simplifying the bidirectional communication between gateway and runner. The change affects the protocol schema, Rust implementation, and TypeScript SDK.

Code Quality & Architecture

✅ Strengths

  1. Clean Protocol Simplification: The removal of the bidirectional ack mechanism is a good architectural decision that reduces complexity. The new approach using gateway_id for routing is more straightforward.

  2. Proper Versioning Handling: The versioned protocol conversion in versioned.rs correctly handles backwards compatibility by mapping old TunnelAck messages to appropriate default values (lines 1231-1239, 1331-1339).

  3. Consistent Error Handling: Rust error handling follows the project's patterns with proper context chaining using anyhow.

  4. Good Logging: Structured logging is used appropriately with tracing::debug, tracing::warn, etc.

🔍 Issues & Concerns

1. Critical: Incomplete TODO Implementation (shared_state.rs:227-249)

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Issue: The ToGatewayKeepAlive handler is completely commented out with a TODO. This means:

  • Pending messages are never being cleaned up from in_flight.pending_msgs
  • The new keep-alive mechanism is not functional
  • Could lead to memory leaks as pending messages accumulate

Recommendation: Either implement the keep-alive logic or remove this variant from the protocol if it's not needed yet.

2. Missing Field in Protocol Message (ws_to_tunnel_task.rs:430-437)

let message: protocol::ToServer = {
    tag: "ToServerTunnelMessage",
    val: {
        requestId,
        messageId,
        messageKind,
    },
};

Issue: The ToServerTunnelMessage now requires a gatewayId field (according to the schema v3.bare:282-287), but this code is not setting it. This will cause a compilation error or runtime panic.

Recommendation: Add the gateway_id field:

val: {
    gatewayId: self.#runner.__gatewayId(), // or similar
    requestId,
    messageId,
    messageKind,
},

3. Logic Inversion Error (tunnel.ts:776-783)

if (this.#runner.hasActor(actorId, generation)) {
    this.log?.warn({
        msg: "actor not loaded to send response, assuming gateway has closed request",
        // ...
    });
    return;
}

Issue: The condition appears to be inverted. It warns "actor not loaded" when the actor does exist (hasActor returns true). This should likely be !this.#runner.hasActor(...).

Same issue at line 825.

4. Dead Code in TypeScript (tunnel.ts:413-420)

// Store the pending message in the actor's map
const actor = this.getRequestActor(requestIdStr);
if (actor) {
    actor.pendingTunnelMessages.set(messageIdStr, {
        sentAt: Date.now(),
        requestIdStr,
    });
}

Issue: With TunnelAck removed, the pending tunnel messages are still being tracked but never cleaned up except through the garbage collector timeout. This creates unnecessary memory overhead.

Recommendation:

  • If tunnel message acknowledgment is no longer needed, remove this tracking entirely
  • If it's still needed for keep-alive, implement the cleanup in the keep-alive handler

5. Potential Memory Leak in Gateway GC (shared_state.rs:447-453)

if let Some(earliest_pending_msg) = req.pending_msgs.first() {
    if now.duration_since(earliest_pending_msg.send_instant)
        <= MESSAGE_ACK_TIMEOUT
    {
        break 'reason Some(MsgGcReason::MessageNotAcked{message_id:Uuid::from_bytes(earliest_pending_msg.message_id)});
    }
}

Issue: The condition checks if duration is <= timeout, which means it only creates a GC reason when the message is within the timeout period. This seems backwards - it should be > to trigger cleanup when messages are older than the timeout.

6. Missing Gateway ID Population (tunnel.ts)

The TypeScript SDK's #sendMessage method doesn't populate the gatewayId field that's now required in the protocol schema. This will cause serialization errors.

Protocol & Schema

v3.bare Schema Changes

  • ✅ Properly removes TunnelAck from both ToServerTunnelMessageKind and ToClientTunnelMessageKind
  • ✅ Adds GatewayId type and includes it in tunnel messages
  • ✅ Introduces ToRunner and ToGateway top-level unions with keep-alive variants
  • ⚠️ Breaking change is properly versioned as v3

Testing Recommendations

  1. Add integration tests for the new keep-alive mechanism
  2. Test edge cases around gateway disconnection and reconnection
  3. Verify memory leak prevention - run load tests to ensure pending messages don't accumulate
  4. Test backwards compatibility with v2 clients (if applicable)
  5. Verify WebSocket hibernation still works correctly without TunnelAck

Security Considerations

No significant security concerns identified. The gateway_id-based routing doesn't introduce new attack vectors.

Performance Considerations

  • ✅ Removing bidirectional acks should reduce message overhead
  • ⚠️ Potential memory leaks from unimplemented cleanup could degrade performance over time
  • ⚠️ GC timeout logic errors could cause premature or delayed cleanups

Action Items

Must Fix Before Merge

  1. Implement or remove ToGatewayKeepAlive handler (shared_state.rs:227-249)
  2. Fix missing gatewayId field in TypeScript SDK (tunnel.ts:430-437)
  3. Fix inverted logic in #sendResponse and #sendResponseError (tunnel.ts:776, 825)
  4. Fix GC timeout condition logic (shared_state.rs:449)

Should Address

  1. Decide whether to keep or remove pending tunnel message tracking in TypeScript
  2. Add tests for the new protocol behavior
  3. Document the protocol change and migration path

Overall Assessment

The core architectural change is sound, but the implementation is incomplete and has several bugs that will cause runtime failures. The most critical issues are:

  • Unimplemented keep-alive logic
  • Missing required protocol fields
  • Inverted conditional logic

Recommendation: Do not merge until the "Must Fix" items are addressed. The bugs are significant enough to cause immediate failures in production.

@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review: Remove TunnelAck from Protocol

Summary

This PR removes the acknowledgment mechanism (TunnelAck) from the tunnel protocol, simplifying the communication flow between the pegboard gateway and runner. The changes introduce a new message type ToRunner and embed gateway IDs directly in tunnel messages.


Code Quality & Architecture

✅ Strengths

  1. Clean Protocol Simplification: Removing the TunnelAck mechanism reduces complexity in the bidirectional communication protocol. The previous approach of tracking active requests in both gateway and runner is replaced with a stateless design using gateway_id.

  2. Good Backward Compatibility Handling: The version conversion code properly handles the deprecated TunnelAck by converting it to ToClientRequestAbort / ToServerResponseAbort in the v2-to-v3 converters (lines 1231-1236 and 1331-1338 in versioned.rs).

  3. Type Safety Improvements: The introduction of ToRunner as a separate enum from ToClient provides better type safety and clearer separation of concerns between gateway-to-runner and general client messages.

  4. Proper Structured Logging: Uses tracing correctly with structured fields (?request_id, ?message_id) rather than formatting into strings.


Issues & Concerns

🔴 Critical: Dead Code & TODOs

Location: packages/pegboard-gateway/src/shared_state.rs:227-250

The ToGatewayKeepAlive handler contains only commented-out TODO code. This suggests incomplete implementation:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...

Recommendation: Either:

  • Implement the keep-alive mechanism fully
  • Remove this variant if it's not needed yet
  • Document why it's deferred and create a tracking issue

🔴 Critical: Removed Safety Mechanism Without Replacement

Location: packages/pegboard-runner/src/lib.rs:246-276

The PR removes code that sends close messages to active HTTP requests when the runner WebSocket disconnects. This was important for cleaning up orphaned requests:

// Send close messages to all remaining active requests
let active_requests = conn.tunnel_active_requests.lock().await;
for (request_id, req) in &*active_requests {
    if req.is_ws { continue; }
    // Send ToServerResponseAbort...
}

Issue: With this removed, if a runner disconnects unexpectedly, the gateway may not be notified about pending HTTP requests that will never complete.

Recommendation:

  • Verify that the gateway has timeout mechanisms to handle this case
  • Add tests to ensure requests don't hang indefinitely when runners disconnect
  • Document the new cleanup strategy

⚠️ Security: Gateway ID Validation

Location: packages/pegboard-runner/src/ws_to_tunnel_task.rs:368-369

The runner now constructs the gateway reply-to subject directly from the gateway_id in the message:

let gateway_reply_to = GatewayReceiverSubject::new(Uuid::from_bytes(msg.gateway_id)).to_string();

Issue: There's no validation that this gateway ID is legitimate. A malicious actor could potentially:

  • Send messages to arbitrary gateway subjects
  • Cause routing issues or message interception

Recommendation:

  • Validate that the gateway_id matches the gateway that initiated the request
  • Consider signing or encrypting the gateway_id if untrusted clients can influence tunnel messages
  • Add documentation about the trust model

⚠️ Logic Issue: Incorrect Timeout Comparison

Location: packages/pegboard-gateway/src/shared_state.rs:448-452

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

Issue: The condition is backwards. It creates a GC reason when the duration is less than or equal to the timeout, but it should create a reason when the duration exceeds the timeout.

Expected:

if now.duration_since(earliest_pending_msg.send_instant) >= MESSAGE_ACK_TIMEOUT {

Same issue exists at line 458-462 for WebSocket messages.

Impact: This will cause messages to be GC'd immediately after sending instead of after the timeout period.

⚠️ Performance: Redundant Cloning

Location: packages/pegboard-gateway/src/shared_state.rs:166

message_id: message_id.clone(),

Issue: message_id is a [u8; 16] which is Copy, so .clone() is redundant.

Recommendation: Remove .clone() for consistency and clarity.


Testing Concerns

Missing Test Coverage

  1. Protocol Version Conversion: The backward compatibility conversions (v2 → v3 with TunnelAck removal) should have tests verifying:

    • v2 messages with TunnelAck convert correctly to v3
    • v3 messages serialize/deserialize properly
  2. Gateway ID Routing: Tests should verify:

    • Messages are routed to the correct gateway using the embedded gateway_id
    • Invalid gateway IDs are handled gracefully
  3. Runner Disconnection: Tests should verify:

    • HTTP requests don't hang when runners disconnect
    • WebSocket connections clean up properly
    • Pending messages are handled correctly during disconnection
  4. Keep-Alive Mechanism: Once implemented, needs comprehensive testing

Test Recommendation

#[tokio::test]
async fn test_gateway_id_routing() {
    // Verify messages route to correct gateway
}

#[tokio::test]
async fn test_runner_disconnect_cleanup() {
    // Verify requests don't hang on disconnect
}

Protocol Schema Changes

v3.bare Schema

The schema changes look clean:

Good:

  • GatewayId type properly defined as data[16]
  • TunnelAck removed from both ToClient and ToServer message kinds
  • ToRunner union introduced for gateway-to-runner messages
  • ToGateway changed from struct to union for future extensibility

⚠️ Note: The TypeScript SDK shows significant changes (184 additions, 49 deletions). Ensure:

  • All TypeScript consumers are updated
  • Breaking changes are documented
  • Migration guide is provided if needed

Recommendations Summary

Must Fix Before Merge:

  1. Fix the timeout comparison logic (lines 448-452, 458-462 in shared_state.rs)
  2. Implement or remove ToGatewayKeepAlive handler with justification
  3. Verify request cleanup strategy when runner disconnects
  4. Add gateway ID validation or document trust model

Should Address:

  1. Add tests for protocol version conversion
  2. Add tests for gateway ID routing
  3. Add tests for runner disconnection scenarios
  4. Remove redundant .clone() on Copy types
  5. Document breaking changes for SDK consumers

Nice to Have:

  1. Add inline documentation explaining the new gateway_id flow
  2. Consider adding metrics for message routing success/failure
  3. Add tracing for keep-alive messages once implemented

Conclusion

This is a solid refactoring that simplifies the protocol by removing the acknowledgment mechanism. However, there are critical bugs (timeout comparison logic) and missing implementations (keep-alive, request cleanup) that need to be addressed before merging.

The architectural direction is sound, but the implementation needs completion and additional testing to ensure reliability.

Recommendation: Request changes to fix critical issues before merging.

@claude
Copy link

claude bot commented Nov 21, 2025

PR Review: Remove TunnelAck

Thanks for this cleanup! This PR simplifies the tunnel messaging protocol by removing the TunnelAck mechanism and instead using gateway_id to route messages directly. The overall approach is solid and the code changes look good. Here are my findings:

Strengths

  1. Protocol Simplification: Removing the two-way ack mechanism reduces complexity and overhead. The new approach using gateway_id for direct routing is cleaner.

  2. Proper Versioning: The changes handle backwards compatibility well through the versioned protocol system, with appropriate handling of the deprecated TunnelAck in the v2→v3 conversion.

  3. State Management Cleanup: Removed tunnel_active_requests tracking from the runner, which is no longer needed with the new routing approach.

  4. Consistent Error Handling: The code follows the project's error handling patterns using anyhow::Result and context messages.

🔍 Potential Issues

1. Incomplete Keep-Alive Implementation (Medium Priority)

In pegboard-gateway/src/shared_state.rs:227-250, there's commented-out TODO code for handling ToGatewayKeepAlive:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Issue: The keep-alive mechanism was introduced but not fully implemented. This could affect the ability to track message delivery and clean up pending messages properly.

Recommendation: Either complete the keep-alive implementation or document why it's deferred and create a tracking issue.

2. Potential Message Ordering Issue (Low Priority)

In tunnel_to_ws_task.rs:78-86, the hibernating_request_ids are populated from the database, but the mutation of command_wrappers happens within a match arm:

protocol::ToRunner::ToClientCommands(mut command_wrappers) => {
    for command_wrapper in &mut command_wrappers {
        // ... mutates command_wrapper
    }
    protocol::ToClient::ToClientCommands(command_wrappers)
}

Issue: While this works correctly, the comment "NOTE: command_wrappers is mutated in this match arm" suggests this could be error-prone for future maintainers.

Recommendation: Consider extracting this logic into a separate function for clarity.

3. GC Logic May Not Remove Messages (High Priority)

In pegboard-gateway/src/shared_state.rs:447-453, the GC logic has inverted comparison logic:

if let Some(earliest_pending_msg) = req.pending_msgs.first() {
    if now.duration_since(earliest_pending_msg.send_instant)
        <= MESSAGE_ACK_TIMEOUT
    {
        break 'reason Some(MsgGcReason::MessageNotAcked{...});
    }
}

Issue: This checks if the duration is <= MESSAGE_ACK_TIMEOUT, but it should check if it's > or >= to trigger GC when the timeout has been exceeded. Same issue exists for WebSocket messages at lines 458-462.

Current behavior: Messages within the timeout period trigger GC (incorrect)
Expected behavior: Messages exceeding the timeout period should trigger GC

This is a critical bug that would cause premature cleanup of pending messages.

4. Dead Code in Versioning (Low Priority)

In versioned.rs:1231-1236, the backwards compatibility code converts v2's TunnelAck to v3's ToClientRequestAbort:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Issue: While the comment says "this shouldn't happen", converting it to ToClientRequestAbort could cause unintended request termination if old messages are somehow processed.

Recommendation: Consider logging a warning or error if this code path is ever hit, or explicitly panic if it truly should never occur.

🎯 Performance Considerations

  1. Reduced Message Overhead: Removing bidirectional acks reduces network traffic and CPU overhead for serialization/deserialization. ✅

  2. Direct Routing: Using gateway_id for direct routing is more efficient than maintaining active request state on the runner side. ✅

  3. Pending Message Management: The existing timeout-based GC is reasonable, but the bug mentioned above needs fixing.

🔒 Security Considerations

  1. Gateway ID Validation: The code trusts gateway_id from messages without validation. Consider whether malicious actors could exploit this by sending messages with arbitrary gateway IDs to intercept traffic.

  2. Error Messages: The code properly uses structured logging and doesn't leak sensitive information in error messages. ✅

📋 Test Coverage

The PR doesn't include test changes. Consider adding:

  • Unit tests for the v2→v3 protocol conversion edge cases
  • Integration tests to verify messages are routed correctly with gateway_id
  • Tests for the timeout/GC logic (especially after fixing the comparison bug)

📝 Code Style & Best Practices

  1. Follows Repository Conventions: Uses structured logging with tracing, proper error context, and workspace dependencies. ✅

  2. Documentation: The protocol schema is well-documented. Consider adding more inline comments about the lifecycle of gateway_id routing.


Priority Recommendations

  1. [HIGH] Fix the inverted timeout comparison in GC logic (lines 447-453, 458-462)
  2. [MEDIUM] Complete or remove the ToGatewayKeepAlive TODO implementation
  3. [LOW] Add logging/error handling for the deprecated TunnelAck conversion path
  4. [LOW] Consider adding tests for protocol versioning and message routing

Overall, this is a good refactoring that simplifies the architecture. The main concern is the GC timeout bug which should be addressed before merging.


🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review: Remove TunnelAck

Summary

This PR removes the TunnelAck acknowledgment mechanism from the runner protocol and replaces it with a simpler gateway ID-based routing approach. Overall, this is a significant architectural improvement that simplifies the tunnel message flow. The implementation is solid with good attention to backwards compatibility.

Strengths

1. Simplified Architecture

  • Before: Complex bidirectional ack system tracking pending messages
  • After: Cleaner unidirectional flow using gateway_id for routing
  • Removes approximately 140 lines of state management code
  • Eliminates the need to track active requests in pegboard-runner/src/conn.rs

2. Protocol Evolution

The new ToRunner and ToGateway union types provide better separation of concerns

3. Backwards Compatibility

Good handling where v2 TunnelAck messages are converted to abort messages

Issues and Concerns

CRITICAL: Incomplete Refactor - Dead Code

Location: pegboard-gateway/src/shared_state.rs:227-249

The ToGatewayKeepAlive handler contains commented-out TODO code. The pending_msgs field still exists but is never cleaned up, which will cause it to grow unbounded and potentially leak memory.

Recommendation: Either complete the ToGatewayKeepAlive implementation, remove pending_msgs entirely, or add a clear comment explaining the plan.

MODERATE: Missing gateway_id in TypeScript

Location: engine/sdks/typescript/runner/src/tunnel.ts:430-438

The TypeScript runner sendMessage method does not populate gateway_id in ToServerTunnelMessage, but the protocol shows gatewayId as a required field. This will likely cause serialization errors.

The TypeScript code needs to extract gateway_id from incoming messages, store it per-request, and include it in outgoing responses.

MODERATE: Index Conversion Logic Concern

Location: engine/sdks/rust/runner-protocol/src/versioned.rs:1220

Verify that v2 protocol schema actually has the index field, otherwise this will fail at runtime when deserializing v2 messages.

Action Items

Before Merge (Critical):

  • Resolve the ToGatewayKeepAlive TODO or remove pending_msgs entirely
  • Fix TypeScript runner to populate gateway_id in outgoing messages
  • Verify v2 ToClientWebSocketMessage.index field compatibility

Verdict

This is a good architectural improvement, but there are critical issues that must be addressed before merging. Once these are resolved, this PR will significantly simplify the tunnel protocol. Great work on maintaining backwards compatibility!

Generated with Claude Code

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025
@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants